Analysis for Twain Works through Module 7 (Language Models, NLP, Vector Space Models, Similarity and Clustering, PCA)

DS 5001: Exploratory Text Analytics

Cecily Wolfe (cew4pf)

Spring 2022

M03: Language Models

Create Training Vocab ($V_{train}$)

Generate Training Sentences

Generate and Count Ngrams

n-gram token table

Unigram table

Bigram table

Trigram table

Create language model

Generate a text with the .generate_text() method of the langmod.NgramLanguageModel object (model)

Examining redundancy for unigrams, bigrams, trigrams $\rightarrow$ redundancy increases

Using the bigram model represented as a matrix (too large to use BGX = model.LM[1].n.unstack() so use method below), explore the relationship between bigram pairs using the following lists for the first and second words of the bigrams of interest

M05: Vector Space Models

Zipf's Law:

Add Term Rank $r$ to VOCAB

Alternate Rank: words that appear the same number of times given the same rank

Compute Zipf's $k$ using term_rank and term_rank2

Rank vs. N (frequency n)

As rank (term_rank2) increases, frequnecy (n) decreases

BOW (Bag of Words) and TFIDF (Term Frequency - Inverse Document Frequency)

Document-Term Count Matrix DTCM

Reduce number of features in VOCAB, TFIDF matrix to the 1000 most significant terms

"Collapse" the TFIDF matrix so that it contains mean TFIDF of each term by book.

Rank and TFIDF Mean

Rank and DFIDF

M06: Similarity and Clustering

Collapse Bags (to use for clustering)

Mean TFIDF for each book for all terms

Mean TFIDF for all book for 1000 most significant terms only

DOC Table

Normalized Tables for Clustering

Create table of book pairs (doc pair table PAIRS)

Compute distance measures between all pairs of books using pdist()

Compare Distributions

Hierarchical agglomerative cluster diagrams for the distance measures

Top 20 nouns by DFIDF, sorted in descending order (including plural nouns but not proper nouns)

Most "Significant" Book based on mean TFIDF

Compare Distributions

Compare Z normalized distributions

K-Means

Alogirthm Overview

M07: Features and Components

Manual PCA Methods with Only 1000 Most Significant Terms (excluding proper nouns)

Prince PCA Method with entire TFIDF

THERE ARE OUTLIERS IN THE ABOVE PLOT THAT ARE SKEWING THE RESULTS

It seems as though the first principal component (PC 0) is for vernacular and the second one (PC 1) is for German

The Adventures of Huckleberry Finn (76)

Mark Twain Speeches (3188)

The Tragedy of Pudd'nhead Wilson (102)

Note that the chapter numbers listed below are one chapter greater than those in the book (e.g., Chapter 4 below is actually Chapter 14) but there is an intro

How to Tell A Story and Other Essays (3250)

Merry Tales (60900)

Sketches New and Old (3189)

Prince PCA Method with Outliers from above Removed

Prince PCA Method with Only 1000 Most Significant Words (TFIDF_sigs)

Sources